parsermd

implements a C++ parser and abstract syntax tree (AST) for Quarto and R Markdown documents in R.

  • supports manipulating ASTs (filtering, editing, etc.)

  • nodes classes use S7 for validation and dispatch

  • ability to directly source and render ASTs

  • off-and-on project since covid, original use case was to aide in the grading for a large machine learning course

  • v0.1.3 is on CRAN, v0.2.0 with full Quarto support on GitHub (CRAN soon*)

  • Quarto examples today, but everything works with RMarkdown

hello.qmd

---
title: "Hello, Quarto"
format:
  html:
    self-contained: true
---
  
```{r}
#| label: load-packages
#| include: false

library(tidyverse)
library(palmerpenguins)
```

## Meet Quarto

Quarto enables you to weave together content and executable code into a finished document. 
To learn more about Quarto see <https://quarto.org>.

## Meet the penguins

![](https://raw.githubusercontent.com/quarto-dev/quarto-web/main/docs/get-started/hello/rstudio/lter_penguins.png){style="float:right;" fig-alt="Illustration of three species of Palmer Archipelago penguins: Chinstrap, Gentoo, and Adelie. Artwork by @allison_horst." width="401"}

The `penguins` data from the [**palmerpenguins**](https://allisonhorst.github.io/palmerpenguins "palmerpenguins R package") 
package contains size measurements for `{r} nrow(penguins)` penguins from three species
observed on three islands in the Palmer Archipelago, Antarctica.

The plot below shows the relationship between flipper and bill lengths of these penguins.

```{r}
#| label: plot-penguins
#| warning: false
#| echo: false

ggplot(penguins, 
       aes(x = flipper_length_mm, y = bill_length_mm)) +
  geom_point(aes(color = species, shape = species)) +
  scale_color_manual(values = c("darkorange","purple","cyan4")) +
  labs(
    title = "Flipper and bill length",
    subtitle = "Dimensions for penguins at Palmer Station LTER",
    x = "Flipper length (mm)", y = "Bill length (mm)",
    color = "Penguin species", shape = "Penguin species"
  ) +
  theme_minimal()
```

## Other Quarto features

### Fenced divs

:::{.callout-note}
Note that there are five types of callouts, including: 
`note`, `tip`, `warning`, `caution`, and `important`.
:::

### Markdown code blocks

Some sample python code,

```python
import numpy as np
import matplotlib.pyplot as plt

r = np.arange(0, 2, 0.01)
theta = 2 * np.pi * r
fig, ax = plt.subplots(
  subplot_kw = {'projection': 'polar'} 
)
ax.plot(theta, r)
ax.set_rticks([0.5, 1, 1.5, 2])
ax.grid(True)
plt.show()
```

### Short codes

Shortcodes are special markdown directives that generate various types of content,

{{< lipsum 1 >}}

Elements as AST

qmd = parse_qmd("hello.qmd")
qmd |> print(flat = TRUE)
├── YAML [2 fields]
├── Markdown [1 line]
├── Chunk [r, 4 lines] - load-packages
├── Heading [h2] - Meet Quarto
├── Markdown [1 line]
├── Heading [h2] - Meet the penguins
├── Markdown [7 lines]
├── Chunk [r, 12 lines] - plot-penguins
├── Heading [h2] - Other Quarto features
├── Heading [h3] - Fenced divs
├── Open Fenced div [.callout-note]
├── Markdown [2 lines]
├── Close Fenced div 
├── Heading [h3] - Markdown code blocks
├── Markdown [1 line]
├── Code block [python, 12 lines]
├── Heading [h3] - Short codes
└── Markdown [3 lines]
qmd |> print()
├── YAML [2 fields]
├── Markdown [1 line]
├── Chunk [r, 4 lines] - load-packages
├── Heading [h2] - Meet Quarto
│   └── Markdown [1 line]
├── Heading [h2] - Meet the penguins
│   ├── Markdown [7 lines]
│   └── Chunk [r, 12 lines] - plot-penguins
└── Heading [h2] - Other Quarto features
    ├── Heading [h3] - Fenced divs
    │   ├── Open Fenced div [.callout-note]
    │   │   └── Markdown [2 lines]
    │   └── Close Fenced div 
    ├── Heading [h3] - Markdown code blocks
    │   ├── Markdown [1 line]
    │   └── Code block [python, 12 lines]
    └── Heading [h3] - Short codes
        └── Markdown [3 lines]

Why hierarchical?

Assuming a hierarchy lets us use a CSS selector like approach to target specific nodes based on headings and their descendents,

qmd |> rmd_select(by_section("Meet Quarto"))
├── YAML [2 fields]
└── Heading [h2] - Meet Quarto
    └── Markdown [1 line]
qmd |> rmd_select(by_section("Fenced divs"))
├── YAML [2 fields]
└── Heading [h3] - Fenced divs
    ├── Open Fenced div [.callout-note]
    │   └── Markdown [2 lines]
    └── Close Fenced div 
qmd |> 
  rmd_select(by_section(c("Other Quarto features", "*code*")))
├── YAML [2 fields]
└── Heading [h2] - Other Quarto features
    ├── Heading [h3] - Markdown code blocks
    │   ├── Markdown [1 line]
    │   └── Code block [python, 12 lines]
    └── Heading [h3] - Short codes
        └── Markdown [3 lines]
qmd |> 
  rmd_select(
    by_section(
      c("Other Quarto features", "*code*"), 
      keep_parents = FALSE
    ), 
    keep_yaml = FALSE
  )
├── Heading [h3] - Markdown code blocks
│   ├── Markdown [1 line]
│   └── Code block [python, 12 lines]
└── Heading [h3] - Short codes
    └── Markdown [3 lines]

as_document()

ASTs and nodes can be converted back to Quarto documents,

qmd |> 
  rmd_select(by_section("Meet Quarto")) |>
  as_document() |>
  cat(sep = "\n")
---
title: Hello, Quarto
format:
  html:
    self-contained: true
---

## Meet Quarto

Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see <https://quarto.org>.
qmd |> 
  rmd_select(by_section("Fenced divs")) |>
  as_document() |>
  cat(sep = "\n")
---
title: Hello, Quarto
format:
  html:
    self-contained: true
---

### Fenced divs

::: {.callout-note}

Note that there are five types of callouts, including: 
`note`, `tip`, `warning`, `caution`, and `important`.


:::

qmd |> 
  rmd_select(
    by_section(c("Other Quarto features", "*code*"), 
               keep_parents = FALSE),
    keep_yaml = FALSE
  ) |>
  as_document() |>
  cat(sep = "\n")
### Markdown code blocks

Some sample python code,

``` python
import numpy as np
import matplotlib.pyplot as plt

r = np.arange(0, 2, 0.01)
theta = 2 * np.pi * r
fig, ax = plt.subplots(
  subplot_kw = {'projection': 'polar'} 
)
ax.plot(theta, r)
ax.set_rticks([0.5, 1, 1.5, 2])
ax.grid(True)
plt.show()
```

### Short codes

Shortcodes are special markdown directives that generate various types of content,

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis sagittis posuere ligula sit amet lacinia. Duis dignissim pellentesque magna, rhoncus congue sapien finibus mollis. Ut eu sem laoreet, vehicula ipsum in, convallis erat. Vestibulum magna sem, blandit pulvinar augue sit amet, auctor malesuada sapien. Nullam faucibus leo eget eros hendrerit, non laoreet ipsum lacinia. Curabitur cursus diam elit, non tempus ante volutpat a. Quisque hendrerit blandit purus non fringilla. Integer sit amet elit viverra ante dapibus semper. Vestibulum viverra rutrum enim, at luctus enim posuere eu. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.

Additional selectors

Still a work in progress but all of the following selectors are currently supported:

  • by_section()

  • has_type()

  • has_label()

  • has_heading()

  • has_option()

  • has_shortcode()

  • by_fdiv()

qmd |> rmd_select("plot-penguins")
├── YAML [2 fields]
└── Chunk [r, 12 lines] - plot-penguins
qmd |> rmd_select(has_label("*p*"))
├── YAML [2 fields]
├── Chunk [r, 4 lines] - load-packages
└── Chunk [r, 12 lines] - plot-penguins
qmd |> rmd_select(has_type(c("rmd_yaml", "rmd_chunk")))
├── YAML [2 fields]
├── Chunk [r, 4 lines] - load-packages
└── Chunk [r, 12 lines] - plot-penguins
qmd |> rmd_select(!has_type("rmd_markdown"))
├── YAML [2 fields]
├── Chunk [r, 4 lines] - load-packages
├── Heading [h2] - Meet Quarto
├── Heading [h2] - Meet the penguins
│   └── Chunk [r, 12 lines] - plot-penguins
└── Heading [h2] - Other Quarto features
    ├── Heading [h3] - Fenced divs
    │   ├── Open Fenced div [.callout-note]
    │   └── Close Fenced div 
    ├── Heading [h3] - Markdown code blocks
    │   └── Code block [python, 12 lines]
    └── Heading [h3] - Short codes

Rendering

ASTs can also be directly rendered

qmd |> render("hello_quarto")
qmd |> rmd_select(has_type("rmd_chunk")) |> render("hello_quarto_code")

Modifying ASTs

rmd_modify() is a recent addition that allows for modifying ASTs in place, the arguments are a node modifying function and then one or more rmd_select() helper functions.

qmd |> 
  rmd_select(has_type("rmd_chunk")) |>
  as_document() |>
  cat(sep="\n")
---
title: Hello, Quarto
format:
  html:
    self-contained: true
---

```{r}
#| label: load-packages


library(tidyverse)
library(palmerpenguins)
```

```{r}
#| label: plot-penguins
#| warning: false
#| echo: false

ggplot(penguins, 
       aes(x = flipper_length_mm, y = bill_length_mm)) +
  geom_point(aes(color = species, shape = species)) +
  scale_color_manual(values = c("darkorange","purple","cyan4")) +
  labs(
    title = "Flipper and bill length",
    subtitle = "Dimensions for penguins at Palmer Station LTER",
    x = "Flipper length (mm)", y = "Bill length (mm)",
    color = "Penguin species", shape = "Penguin species"
  ) +
  theme_minimal()
```
qmd |>
  rmd_select(has_type("rmd_chunk")) |>
  rmd_modify(
    function(x) {
      rmd_node_options(x) = list(echo=TRUE, message=FALSE)
      x
    },
    has_type("rmd_chunk")
  ) |>
  as_document() |>
  cat(sep="\n")
---
title: Hello, Quarto
format:
  html:
    self-contained: true
---

```{r}
#| label: load-packages
#| echo: true
#| message: false


library(tidyverse)
library(palmerpenguins)
```

```{r}
#| label: plot-penguins
#| warning: false
#| echo: true
#| message: false

ggplot(penguins, 
       aes(x = flipper_length_mm, y = bill_length_mm)) +
  geom_point(aes(color = species, shape = species)) +
  scale_color_manual(values = c("darkorange","purple","cyan4")) +
  labs(
    title = "Flipper and bill length",
    subtitle = "Dimensions for penguins at Palmer Station LTER",
    x = "Flipper length (mm)", y = "Bill length (mm)",
    color = "Penguin species", shape = "Penguin species"
  ) +
  theme_minimal()
```

Example Workflow


One file to rule them all

Problem statement

I distribute my assignments as GitHub repos that contain a README.md and hw1.qmd file + other file based infrastructure.

I inevitably end up having to maintain both hw1/ and hw1-key/ versions of the assignment.

  • Different repos for different audiences: students vs TAs

  • Repos have a tendency to drift over time

  • Single repo with student scaffolding and solution code is ideal for maintenance but clunky for actual work

hw1.qmd

---
title: "Homework 3 - Data Analysis with R"
author: "Your Name"
date: "Due: Friday, March 15, 2024"
format: html
execute:
  warning: false
  message: false
---

## Setup

Load the required packages for this assignment:

```{r setup}
library(tidyverse)
library(palmerpenguins)
```

## Exercise 1: Basic Data Exploration

Examine the `penguins` dataset from the `palmerpenguins` package. Your task is to create a summary of the dataset that shows the number of observations and variables, and identify any missing values.

```{r ex1-student}
# Write your code here to:
# 1. Display the dimensions of the penguins dataset
# 2. Show the structure of the dataset
# 3. Count missing values in each column

```

```{r ex1-key}
# Solution: Basic data exploration
# 1. Display dimensions
cat("Dataset dimensions:", dim(penguins), "\n")
cat("Rows:", nrow(penguins), "Columns:", ncol(penguins), "\n\n")

# 2. Show structure
str(penguins)

# 3. Count missing values
cat("\nMissing values by column:\n")
penguins %>%
  summarise(across(everything(), ~ sum(is.na(.))))
```

## Exercise 2: Data Visualization

Create a scatter plot showing the relationship between flipper length and body mass for penguins. Color the points by species and add appropriate labels and a title.

```{r ex2-student}
# Create a scatter plot with:
# - x-axis: flipper_length_mm
# - y-axis: body_mass_g
# - color by species
# - add appropriate labels and title

ggplot(data = penguins, aes(x = ___, y = ___)) +
  geom_point(aes(color = ___)) +
  labs(
    title = "___",
    x = "___",
    y = "___"
  )
```

```{r ex2-key}
# Solution: Scatter plot of flipper length vs body mass
ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species), alpha = 0.8, size = 2) +
  labs(
    title = "Penguin Flipper Length vs Body Mass by Species",
    x = "Flipper Length (mm)",
    y = "Body Mass (g)",
    color = "Species"
  ) +
  theme_minimal() +
  scale_color_viridis_d()
```

## Exercise 3: Statistical Analysis

Calculate summary statistics for bill length by species. Create a table showing the mean, median, standard deviation, and count for each species.

```{r ex3-student}
# Calculate summary statistics for bill_length_mm by species
# Include: mean, median, standard deviation, and count
# Remove missing values before calculating

penguins %>%
  # Add your code here

```

```{r ex3-key}
# Solution: Summary statistics for bill length by species
penguins %>%
  filter(!is.na(bill_length_mm)) %>%
  group_by(species) %>%
  summarise(
    count = n(),
    mean_bill_length = round(mean(bill_length_mm), 2),
    median_bill_length = round(median(bill_length_mm), 2),
    sd_bill_length = round(sd(bill_length_mm), 2),
    .groups = "drop"
  ) %>%
  arrange(desc(mean_bill_length))
```

## Exercise 4: Advanced Data Manipulation

Filter the dataset to include only penguins with complete data (no missing values), then create a new variable called `bill_ratio` that represents the ratio of bill length to bill depth. Finally, identify which species has the highest average bill ratio.

```{r ex4-student}
# Step 1: Filter for complete cases
# Step 2: Create bill_ratio variable (bill_length_mm / bill_depth_mm)
# Step 3: Calculate average bill_ratio by species
# Step 4: Identify species with highest average ratio

```

```{r ex4-key}
# Solution: Advanced data manipulation
complete_penguins = penguins %>%
  # Remove rows with any missing values
  filter(complete.cases(.)) %>%
  # Create bill_ratio variable
  mutate(bill_ratio = bill_length_mm / bill_depth_mm)

# Calculate average bill ratio by species
bill_ratio_summary = complete_penguins %>%
  group_by(species) %>%
  summarise(
    avg_bill_ratio = round(mean(bill_ratio), 3),
    n = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_bill_ratio))

print(bill_ratio_summary)

# Identify species with highest average bill ratio
highest_ratio_species = bill_ratio_summary %>%
  slice_max(avg_bill_ratio, n = 1) %>%
  pull(species)

cat("\nSpecies with highest average bill ratio:", as.character(highest_ratio_species))
```

## Bonus Exercise: Conditional Logic

Write a function that categorizes penguins as "small", "medium", or "large" based on their body mass. Use the following criteria:
- Small: body mass < 3500g
- Medium: body mass between 3500g and 4500g  
- Large: body mass > 4500g

Apply this function to create a new column and create a summary table.

```{r bonus-student}
# Create a function to categorize penguins by size
categorize_size = function(mass) {
  # Add your conditional logic here
  
}

# Apply the function and create summary
```

```{r bonus-key}
# Solution: Conditional logic for size categorization
categorize_size = function(mass) {
  case_when(
    is.na(mass) ~ "Unknown",
    mass < 3500 ~ "Small",
    mass >= 3500 & mass <= 4500 ~ "Medium",
    mass > 4500 ~ "Large"
  )
}

# Apply the function and create summary
penguins_with_size = penguins %>%
  mutate(size_category = categorize_size(body_mass_g))

# Create summary table
size_summary = penguins_with_size %>%
  count(species, size_category) %>%
  pivot_wider(names_from = size_category, values_from = n, values_fill = 0)

print(size_summary)

# Overall size distribution
penguins_with_size %>%
  count(size_category) %>%
  mutate(percentage = round(n / sum(n) * 100, 1))
```
(hw1 = parse_qmd("hw1.qmd"))
├── YAML [5 fields]
├── Heading [h2] - Setup
│   ├── Markdown [1 line]
│   └── Chunk [r, 2 lines] - setup
├── Heading [h2] - Exercise 1: Basic Data Exploration
│   ├── Markdown [1 line]
│   ├── Chunk [r, 5 lines] - ex1-student
│   └── Chunk [r, 12 lines] - ex1-key
├── Heading [h2] - Exercise 2: Data Visualization
│   ├── Markdown [1 line]
│   ├── Chunk [r, 13 lines] - ex2-student
│   └── Chunk [r, 11 lines] - ex2-key
├── Heading [h2] - Exercise 3: Statistical Analysis
│   ├── Markdown [1 line]
│   ├── Chunk [r, 7 lines] - ex3-student
│   └── Chunk [r, 12 lines] - ex3-key
├── Heading [h2] - Exercise 4: Advanced Data Manipulation
│   ├── Markdown [1 line]
│   ├── Chunk [r, 5 lines] - ex4-student
│   └── Chunk [r, 25 lines] - ex4-key
└── Heading [h2] - Bonus Exercise: Conditional Logic
    ├── Markdown [6 lines]
    ├── Chunk [r, 7 lines] - bonus-student
    └── Chunk [r, 25 lines] - bonus-key

Versions

Student

hw1 |>
  rmd_select(
    !has_label("*-key")
  ) |>
  rmd_modify(
    function(x) {
      rmd_node_label(x) = stringr::str_remove(
        rmd_node_label(x), "-student"
      )
      x
    },
    has_label("*-student")
  )
├── YAML [5 fields]
├── Heading [h2] - Setup
│   ├── Markdown [1 line]
│   └── Chunk [r, 2 lines] - setup
├── Heading [h2] - Exercise 1: Basic Data Exploration
│   ├── Markdown [1 line]
│   └── Chunk [r, 5 lines] - ex1
├── Heading [h2] - Exercise 2: Data Visualization
│   ├── Markdown [1 line]
│   └── Chunk [r, 13 lines] - ex2
├── Heading [h2] - Exercise 3: Statistical Analysis
│   ├── Markdown [1 line]
│   └── Chunk [r, 7 lines] - ex3
├── Heading [h2] - Exercise 4: Advanced Data Manipulation
│   ├── Markdown [1 line]
│   └── Chunk [r, 5 lines] - ex4
└── Heading [h2] - Bonus Exercise: Conditional Logic
    ├── Markdown [6 lines]
    └── Chunk [r, 7 lines] - bonus

TA

hw1 |>
  rmd_select(
    has_heading(c("Exercise *", "Bonus*")),
    has_label(c("*-key", "setup"))
  ) |>
  rmd_modify(
    function(x) {
      rmd_node_options(x) = list(include = FALSE)
      x
    },
    has_label("setup")
  )
├── YAML [5 fields]
├── Chunk [r, 2 lines] - setup
├── Heading [h2] - Exercise 1: Basic Data Exploration
│   └── Chunk [r, 12 lines] - ex1-key
├── Heading [h2] - Exercise 2: Data Visualization
│   └── Chunk [r, 11 lines] - ex2-key
├── Heading [h2] - Exercise 3: Statistical Analysis
│   └── Chunk [r, 12 lines] - ex3-key
├── Heading [h2] - Exercise 4: Advanced Data Manipulation
│   └── Chunk [r, 25 lines] - ex4-key
└── Heading [h2] - Bonus Exercise: Conditional Logic
    └── Chunk [r, 25 lines] - bonus-key

What’s next?

  • The current version will be going up on CRAN soon (revdep checks still need work, other minor polishing)

  • Building out and documenting interesting use cases

  • Building out tools using this infrastructure

  • Improved erogonomics

Sneak peek - markermd

Reach out